Automatic Document Categorization for Highly Nuanced Topics in Massive-Scale Document Collections: The SPEED BIN Program
نویسنده
چکیده
This whitepaper offers a brief introduction to the BIN system of the Social, Political and Economic Event Database (SPEED) project. BIN provides automatic document categorization of highly nuanced topics across massive-scale document archives. The BIN system allows a group of trained human editors to present the computer with a relatively small collection of hand-categorized documents representing a given topic. It uses the semantic characteristics of these documents to develop a statistical model that is capable of identifying other documents on that same topic from the Cline Center global news archive, which contains tens of millions of news reports. Tests have shown that BIN has a false negative (incorrectly discarded relevant documents) rate of 1-4%. This paper outlines the basic premise and motivation behind BIN, its development, and its application to the SPEED project. CONTRIBUTOR
منابع مشابه
A survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملBackground Readings for Collection Synthesis
Automatic collection building will be key to efficient and affordable digital libraries of the future. Since hand-curated collections will not scale with the Web, effective automated techniques are highly desired. This section deals with crawling the Web to extract topics, related documents, or collections. To do that, technologies such as feature extraction, page categorization, andWeb crawlin...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملLatent Dirichlet Allocation for Automatic Document Categorization
In this paper we introduce and evaluate a technique for applying latent Dirichlet allocation to supervised semantic categorization of documents. In our setup, for every category an own collection of topics is assigned, and for a labeled training document only topics from its category are sampled. Thus, compared to the classical LDA that processes the entire corpus in one, we essentially build s...
متن کاملAutomatic Text Categorization and Its Applicationto Text
We develop an automatic text categorization approach and investigate its application to text retrieval. The categorization approach is derived from a combination of a learning paradigm known as instancebased learning and an advanced document retrieval technique known as retrieval feedback. We demonstrate the e ectiveness of our categorization approach using two real-world document collections f...
متن کامل